Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UInt128 improve division by 64 bit on x64 #99747

Merged
merged 8 commits into from
Oct 30, 2024

Conversation

Daniel-Svensson
Copy link
Contributor

@Daniel-Svensson Daniel-Svensson commented Mar 14, 2024

Adds a similar fast path as for decimal
#99212 for when divisor fits in 64 bit.

It should inprove performance quite a bit if the UInt128 is larger than 64bit.
I did not do any benchmarks but based on similar test for 96 by 32 division the new code should be at least 10ns faster.

Question:

  • Have you considered adding overloads for division (and maybe multiplication) against ulong/uint?
    That should allow specialized faster code for common scenarios without adding complicated code to the JIT to detect and handle it without lots of runtime checks for 64/32bit values.

Note: It might be possible to further improve performance for non x64 by calling into decimal`s Div96By32 or Div128By64 if divisor is lower than than the top 64 byte for all other platforms, but it would require measuring the performance.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 14, 2024
@Daniel-Svensson Daniel-Svensson changed the title UInt128 improve division ny 64 bit on x64 UInt128 improve division by 64 bit on x64 Mar 14, 2024
@tannergooding
Copy link
Member

This has build failures that need to be addressed. Looks to be warning due to the usage of an API marked preview.

@Daniel-Svensson
Copy link
Contributor Author

@tannergooding I added suppression for "CA2252 // This API requires opting into preview features" in source.

I assume there is a good reason why preview features are no longer allowed System.Private.CoreLib when it was previously allowed for "dogfooding preview features", so looking at enabling it for the project seemed to be out of question.

Is there a better way to solve it available to corelib / the dotnet runtime repro?

@jkotas
Copy link
Member

jkotas commented Jun 7, 2024

This API is marked with Preview attribute because it produces suboptimal code. Have you verified that this change actually improves performance?

In general, any change that is a performance optimization needs to come with perf numbers.

@Daniel-Svensson
Copy link
Contributor Author

Daniel-Svensson commented Jun 8, 2024

This API is marked with Preview attribute because it produces suboptimal code. Have you verified that this change actually improves performance?

It is not optimal, but it is quite good

As I understand it the major issue which prevents general usage in Math.DivRem is that constant detection (and maybe propagation) is not yet done. The generated code is can often be optimal, unless you use to many value tuple calls and it starts to spill to the stack.

As for preview

In general, any change that is a performance optimization needs to come with perf numbers.

Yes, see Uint128 division benchmarks below, the it should give a much better confidence than my original statement about it being at least 10ns faster. (it is almost 20ns or 9 times faster)

This PR

commit 60cc066

(I built the benchmarks as self contained and then copied over all dll's from (runtime\artifacts\bin\coreclr\windows.x64.Release)


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3593/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-preview.4.24267.66
  [Host] : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=InProcess  Toolchain=InProcessEmitToolchain  

Method A_high A_low B Mean Error StdDev Allocated
DivisionByUInt64 0 7115061911280 1311768467463790321 2.278 ns 0.0052 ns 0.0044 ns -
DivisionByUInt128 0 7115061911280 1311768467463790321 2.760 ns 0.0035 ns 0.0029 ns -
DivisionByConstant 0 7115061911280 1311768467463790321 2.377 ns 0.0043 ns 0.0038 ns -
DivisionByUInt64 1 7115061911280 1311768467463790321 2.322 ns 0.0669 ns 0.0687 ns -
DivisionByUInt128 1 7115061911280 1311768467463790321 2.632 ns 0.0082 ns 0.0069 ns -
DivisionByConstant 1 7115061911280 1311768467463790321 2.439 ns 0.0136 ns 0.0120 ns -

NET 9 preview 4


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3593/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-preview.4.24267.66
  [Host] : .NET 9.0.0 (9.0.24.26619), X64 RyuJIT AVX2

Job=InProcess  Toolchain=InProcessEmitToolchain  

Method A_high A_low B Mean Error StdDev Allocated
DivisionByUInt64 0 7115061911280 1311768467463790321 2.417 ns 0.0048 ns 0.0045 ns -
DivisionByUInt128 0 7115061911280 1311768467463790321 2.725 ns 0.0094 ns 0.0078 ns -
DivisionByConstant 0 7115061911280 1311768467463790321 2.438 ns 0.0032 ns 0.0025 ns -
DivisionByUInt64 1 7115061911280 1311768467463790321 20.622 ns 0.1809 ns 0.1692 ns -
DivisionByUInt128 1 7115061911280 1311768467463790321 22.516 ns 0.4261 ns 0.3777 ns -
DivisionByConstant 1 7115061911280 1311768467463790321 22.324 ns 0.0258 ns 0.0215 ns -
Source code ```// See https://aka.ms/new-console-template for more information using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running;

Console.WriteLine("Hello, World!");

// Run benchmakrs using benchmarks switcher
BenchmarkSwitcher.FromAssembly(typeof(BenchmarkUInt128Division).Assembly).Run(args);

///


/// Benchmark UInt128 division using BenchmarkDotNet.
/// The property A uses Param and is a UInt128 with iether 0 or a random value in upper 64 bits.
/// The property B uses Param and is a UInt128 created from a 64bit value
///

[MemoryDiagnoser]
[InProcess]
public class BenchmarkUInt128Division
{
UInt128 _a;
UInt128 _b;
ulong aHi;
ulong aLo;

[Params(0, 1)]
public ulong A_high
{
    get => aHi;
    set
    {
        aHi = value;
        _a = new UInt128(aHi, aLo);
    }
}

[Params(0x00006789abcdef0)]
public ulong A_low

{
    get { return aLo; }
    set
    {
        aLo = value;
        _a = new UInt128(aHi, aLo);
    }
}

[Params(0x123456789abcdef1)]
public ulong B
{
    get => (ulong)_b;
    set => _b = new UInt128(0ul, value); 
}

[Benchmark]
public UInt128 DivisionByUInt64()
{
    return _a / B;
}

[Benchmark]
public UInt128 DivisionByUInt128()
{
    return _a / _b;
}

[Benchmark]
public UInt128 DivisionByConstant()
{
    return _a / 7;
}

}

</details>

@jkotas
Copy link
Member

jkotas commented Jun 9, 2024

I have done quick perf tests:

using System;
using System.Diagnostics;

var sw = new Stopwatch();

for (;;)
{
   sw.Restart();
   UInt128 sum = 0;
   for (UInt128 i = 1; i < 100_000_000; i++) sum += (UInt128)i / (UInt128)3;
   Console.WriteLine(sw.ElapsedMilliseconds);
}

It runs 1.3x slower with this change. Would it be better to apply this optimization only for non-zero higher bits?

…s known that upper is zero such as division by constant)
@Daniel-Svensson
Copy link
Contributor Author

Would it be better to apply this optimization only for non-zero higher bits?

I've changed the order so that the new code is only used as a fallback in case of non-zero higher bits.
It seems to improve your example from an average of 147 to 127 for me (plain preview.4 gives around 140)

My best guess is that the performance change comes from PGO based aggressive inlining in which case the standard division can get replaced with mul and shift, but the instrinct will not.

@tannergooding
Copy link
Member

Rerunning CI. Going to finish reviewing this today after CI finishes

@Daniel-Svensson
Copy link
Contributor Author

@tannergooding I have updated the suppressions for this PR and the decimal PR to handle the introduction of ExperimentalAttribute, so it compiles again.

@tannergooding
Copy link
Member

This looks good. But could you give some updated perf numbers prior to this getting merged?

@Daniel-Svensson
Copy link
Contributor Author

@tannergooding I reran above benchmarks with rc1 and the last commit of this branc

PR


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
  [Host] : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=InProcess  Toolchain=InProcessEmitToolchain  

Method A_high A_low B Mean Error StdDev Allocated
DivisionByUInt64 0 7115061911280 1311768467463790321 2.431 ns 0.0285 ns 0.0267 ns -
DivisionByUInt128 0 7115061911280 1311768467463790321 2.573 ns 0.0017 ns 0.0015 ns -
DivisionByConstant 0 7115061911280 1311768467463790321 2.415 ns 0.0018 ns 0.0014 ns -
ConstantDivision 0 7115061911280 1311768467463790321 2.457 ns 0.0047 ns 0.0044 ns -
DivisionByUInt64 1 7115061911280 1311768467463790321 18.494 ns 0.0213 ns 0.0178 ns -
DivisionByUInt128 1 7115061911280 1311768467463790321 19.700 ns 0.0721 ns 0.0675 ns -
DivisionByConstant 1 7115061911280 1311768467463790321 21.973 ns 0.0581 ns 0.0515 ns -
ConstantDivision 1 7115061911280 1311768467463790321 2.468 ns 0.0068 ns 0.0064 ns -

RC.1


BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.4317/23H2/2023Update/SunValley3)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 9.0.100-rc.1.24452.12
  [Host] : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX2

Job=InProcess  Toolchain=InProcessEmitToolchain  

Method A_high A_low B Mean Error StdDev Allocated
DivisionByUInt64 0 7115061911280 1311768467463790321 2.411 ns 0.0069 ns 0.0054 ns -
DivisionByUInt128 0 7115061911280 1311768467463790321 2.701 ns 0.0054 ns 0.0050 ns -
DivisionByConstant 0 7115061911280 1311768467463790321 2.421 ns 0.0074 ns 0.0069 ns -
ConstantDivision 0 7115061911280 1311768467463790321 2.457 ns 0.0012 ns 0.0009 ns -
DivisionByUInt64 1 7115061911280 1311768467463790321 19.131 ns 0.0063 ns 0.0059 ns -
DivisionByUInt128 1 7115061911280 1311768467463790321 19.655 ns 0.1543 ns 0.1444 ns -
DivisionByConstant 1 7115061911280 1311768467463790321 21.704 ns 0.0164 ns 0.0153 ns -
ConstantDivision 1 7115061911280 1311768467463790321 2.469 ns 0.0074 ns 0.0066 ns -

your benchmark program

Running with runtime win-x64 - .NET 9.0.0-rc.1.24431.7
1930
1889
1895
1899
1893

❯ .\dotnet.exe C:\Users\danie\source\repos\BenchmarkApp\bin\Release\net9.0\win-x64\BenchmarkApp.dll
Hello, World!
Running with runtime win-x64 - .NET 10.0.0-dev
1861
1857
1875
1879
1878

@tannergooding tannergooding merged commit c43fc89 into dotnet:main Oct 30, 2024
148 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants